Vietnamese Text Retrieval: Test Collection and First Experimentations
نویسنده
چکیده
In this paper we present the Vietnamese specialities in word boundary, morphology, part of speech that must be addressed in information retrieval relative tasks. Our experiments have shown how different types of Vietnamese index terms: " tiӃng " , words, compound words, combination of word and compound word contribute to Vietnamese text processing and retrieval. We also introduce our Vietnamese test collection on which experimentations have been done and report the method used to construct this test collection. 1. Vietnamese specialities Vietnamese is a monosyllabic language which uses a Latin alphabet with accents on the vowels to create new tonalities such " ă " , " â " , " ê " , " ô " , " ѭ ". Vietnamese have six different tons which modify the meaning of the words, for example: ma (phantom), má (cheek), mà (but), mҧ (tomb), mã (code), mҥ (rice seedling). Therefore, we can not use ASCII to encode Vietnamese characters. Instead, there are many character-sets have been using in Vietnamese electronic text such as: ABC, TCVN, VNI, UTF-8…and UFT-8 is the most common nowadays. Consequently, we may need a normalization of encoding prior to the phase of indexing. Vietnamese has a special linguistic unit called " tiӃng " (equivalent to hanzi of Chinese) which is similar to traditional morphemes in respect of content and similar to traditional syllables in respect of form [7]. A Vietnamese word consists of one or more " tiӃng " separated by space, for example: " sách " (book), " dӳ liӋu " (data), " xă hӝi chӫ nghƭa " (socialist) etc. Therefore, the whitespaces can not be used to identify the word boundary. This is a challenge for both Vietnamese Natural Language Processing (NLP) in general and Vietnamese text retrieval in particular. We will discus in details how different kinds of Vietnamese index terms contribute to the precision and recall of IR system in the experimentation section. Vietnamese word is morphologic invariant: The word form is unchanged to its different grammatical roles in the sentence like that in Euro-Indian languages. Therefore, the lemmatization in index phase is not necessary for Vietnamese words. However, there are some exceptions in the processing of which morphologic normalization is needed. These exceptions are raised by two cases: the first is, the usage of vowels i and y is interchangeable in some circumstances such as " bác sƭ " and " bác sӻ …
منابع مشابه
A Basic Framework to Build a Test Collection for the Vietnamese Text Catergorization
The aim of this paper is to present a basic framework to build a test collection for a Vietnamese text categorization. The presented content includes our evaluations of some popular text categorization test collections, our researches on the requirements, the proposed model and the techniques to build the BKTexts test collection for a Vietnamese text categorization. The XML specification of bot...
متن کاملOverview of the Third Text REtrieval Conference (TREC-3)
In November of 1992 the first Text REtrieval Conference (TREC-1) was held at NIST [Harman 1993]. The conference, co-sponsored by ARPA and NIST, brought together information retrieval researchers to discuss their system results on a new large test collection (the TIPSTER collection). This conference became the first in a series of ongoing conferences dedicated to encouraging research in retrieva...
متن کاملOverview of the Second Text Retrieval Conference (TREC-2)
In November of 1992 the first Text REtrieval Conference (TREC-1) was held at NIST (Harman 1993). This conference, co-sponsored by ARPA and NIST, brought together information retrieval researchers to discuss their system results on the new TIPSTER test collection. This was the first time that such groups had ever compared results on the same data using the same evaluation methods, and represente...
متن کاملبررسی نقش انواع بافتار همنویسهها در تعیین شباهت بین مدارک
Aim: Automatic information retrieval is based on the assumption that texts contain content or structural elements that can be used in word sense disambiguation and thereby improving the effectiveness of the results retrieved. Homographs are among the words requiring sense disambiguation. Depending on their roles and positions in texts, homograph contexts could be divided to different types, wit...
متن کاملImage retrieval using the combination of text-based and content-based algorithms
Image retrieval is an important research field which has received great attention in the last decades. In this paper, we present an approach for the image retrieval based on the combination of text-based and content-based features. For text-based features, keywords and for content-based features, color and texture features have been used. Query in this system contains some keywords and an input...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007